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Abstract 

Background: Protein sequence evolution is constrained by the biophysics of folding and function, causing 
interdependence between interacting sites in the sequence. However, current site-independent models of 
sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and 
biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected 
improvements in model performance. This suggests that further innovations are needed for progress in this field. 

Results: Here we develop a coarse-grained physics-based model of protein folding and binding function, and 
compare it to a popular informational model. We find that both models violate the assumption of the native 
sequence being close to a thermodynamic optimum, causing directional selection away from the native state. 
Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that 
vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and 
tends to provide less support for an invariant sites model, although amino acid substitutions are generally 
conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed 
rates across sites. 

Conclusions: Simple coarse-grained models of protein folding can describe some natural features of evolving 
proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper 
packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding 
energy, and binding function, as regards both native and non-native conformations, and describe a large number 
of possible applications for such a model. 



Background 

Protein-coding sequences play a central role in cellular 
functions necessary to produce the vast variation in 
organismal phenotypes observed in nature. To function, 
most proteins must fold into a unique and stable struc- 
ture [1]. The structure is responsible for orienting resi- 
dues necessary for proper function, including binding 
and catalysis. To maintain fitness, protein function and 
the underlying structure must be preserved. Protein 
structure and binding (for example, protein-ligand and 
protein-protein interactions) are determined by the 
interactions of individual amino acid residues. There- 
fore, to mechanistically model the evolution of protein 
sequences, these residue-residue interactions must be 
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considered, relaxing the common assumption of inde- 
pendent evolution of each amino acid position [2]. 
Structural models for sequence evolution where function 
is protein-protein intermolecular interaction will now be 
considered. 

To evaluate if a sequence will fold into a unique and 
stable structure, it must be demonstrated that the 
sequence prefers the folded state to both the unfolded 
state (and folding intermediates) as well as to alternative 
folded states [3]. From a practical perspective, it is 
impossible to enumerate the enormous space of all pos- 
sible alternative conformations [4]. Therefore, sets of 
representative "decoy conformations" must be used to 
approximate folding intermediates and/or other stable 
or meta-stable states [5]. Alternatively, without explicitly 
considering decoy structures, the inter-residue contacts 
from different conformations can be randomly sampled 
[6]. Additionally, a scoring function is needed to 
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quantify differences between the possible states. Strate- 
gies developed for generating such scoring functions 
include those derived statistically from existing structures 
(informational models, specifically coarse-grained pair- 
wise statistical potentials) [7-9] and those derived from 
physical first principles [10,11]. Both types of scoring 
function assume that the native sequence lies close to 
(within a neutral sequence network of) a thermodynamic 
optimum, and that there is a gap in energy between the 
native state and the closest non-native state [12]. 

Formally, the protein sequence evolution problem 
relates to the inverse folding problem (whether a 
sequence will preferentially adopt a particular confirma- 
tion) rather than the folding problem. Underlying both 
problems are similar physical assumptions and similar 
models [13]. 

Protein biological function depends not only on 
proper folding, but also on the ability to bind the target 
ligands. Therefore, to study sequence evolution, binding 
must also be evaluated. From a physical perspective this 
is easier, as the only states to consider are the bound 
and the unbound states. However, it has recently been 
suggested that selective pressures on proteins to avoid 
non-specific binding are also an important aspect of 
protein fitness, necessitating consideration of binding 
"decoys" as well [14]. Identifying the actual selective 
pressures against non-specific binding in a cell is a 
daunting task, but the number and nature of binding 
decoys is a major determinant of the ease of evolving 
new binding interactions. The binding decoy character- 
istics also affect the level of selective constraint on the 
binding interface. 

Site and function independent models clearly do not 
capture critical elements of protein evolution. For struc- 
ture-based models, a good scoring function must pro- 
duce sequences with similar properties to real proteins. 
This includes a hydrophobic core that evolves slower 
than the hydrophilic surface [15]. A dN/dS ratio (the 
ratio of the nonsynonymous nucleotide substitution rate 
to the synonymous nucleotide substitution rate) much 
smaller than unity, particularly for functional residues, 
should also emerge [16]. More particularly, amino acid 
substitution rates should show heterogeneity, expressed 
by a gamma distribution across positions (rejecting an 
equal rates model of evolution) [17], reflecting their rela- 
tive importance to folding and function. While the 
gamma function for rate heterogeneity was adopted for 
model fit rather than mechanistic purposes, it is one of 
the most common parameters in standard evolutionary 
models and accounts for heterogeneity in the substitu- 
tion rate driven by structural signals as well as other sig- 
nals in evolutionary sequence data [18,19]. In addition 
to rate heterogeneity, sequences must also have an 
energy gap between the native and alternative 



conformations to ensure rapid and stable folding [20], 
although structurally disordered proteins represent a 
distinct class of proteins that do not have this property 
[21]. The model should place the native sequence near 
an optimum so as not to provide a signal of directional 
selection when function is not changing. Consequently, 
most mutations should be deleterious or nearly neutral 
rather than adaptive [22]. Lastly, proteins must retain 
their binding function. 

Population size dictates what fitness changes are neu- 
tral as well as what mutations become substitutions [23] 
and must therefore be taken into account. To enable use 
in forward (simulation) and backward (phylogenetic or 
population genetic) studies of evolution, especially in a 
population genetic context, in a complete genome con- 
text, or in studies of interactome evolution, the model 
needs to be coarse grained at a level that allows for suffi- 
cient computational speed. In this context the computa- 
tional cost makes state of the art models and methods 
from the protein structure community [24] intractable. 

Here a novel physics-based scoring function is devel- 
oped and compared with a commonly used informational 
approach on the criteria described above. The physics- 
based model is more specific and predicts less drastic sta- 
bility changes. Both types of models violate the assump- 
tion of high native sequence stability and produce many 
adaptive mutations through directional selection towards 
a scoring function optimum, even though the model 
works with truncation selection. These properties are dis- 
cussed in light of their effect on modeling accuracy and 
improvements to the models are suggested. 

Results 

Protein folding and function cause interdependence 
between sites in the protein sequence. For instance, a 
mutation that removes a cysteine involved in a disulfide 
bridge (interaction with another cysteine residue at a 
different position in the protein) is likely to be deleter- 
ious, whereas mutation of other cysteines (not involved 
in disulfide bridge formation) may be more neutral. Of 
course, disulfide bridges can be lost and interconverted 
to other types of stabilizing interactions over evolution- 
ary time, albeit with a large transitional barrier due to 
the strong site interdependence. Such selective pressures 
can be modeled by scoring the thermodynamic effects 
of mutations. Two approaches to calculating such scores 
are the statistically motivated informational methods 
(specifically, coarse-grained pairwise statistical poten- 
tials) and the first principles physics-based methods. 
Informational models score the likelihood of observing 
specific types of contacts in the folded protein based on 
those seen in previously known structures [12]. Physics- 
based approaches evaluate structures by measuring the 
fit of residues to geometrical properties such as 
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backbone torsion and residue-residue distances [10]. In 
both cases the fit to both the native and the many possi- 
ble non-native conformations must be measured to 
ensure specificity. 

This manuscript sets out to evaluate the ability to 
replicate biological properties of proteins using both 
approaches described above through evolutionary simu- 
lation. A new physics-based approach is described. 
Native sequences were evaluated in both approaches to 
evaluate if they fell within the neutral network of an 
optimum or if they underwent directional selection initi- 
ally. Similarly, sequences sampled from within the opti- 
mum were evolved to examine the evolutionary 
properties of each model. With the physics-based 
model, the relative importance of each term in the fold- 
ing and the binding functions was evaluated. Together, 
these analyses give a picture of the evolutionary perfor- 
mance of each model and the appropriateness for use in 
addressing various questions in molecular evolution. 

The Native State in Physics-based and Informational 
Scoring Functions 

In this study a novel physics-based scoring function for a 
coarse-grained representation of protein structure (Figure 
1) was developed. The model (the details of which are 
described in Methods) includes terms for backbone bond 
angles, helix and sheet propensities, a new disulfide 
bridge term, a Lennard-Jones potential to account for 
van der Waals interactions, a Coloumbic potential, and a 
solvation potential. The last three terms were also 
included as part of the binding score function. The newly 
developed physics-based potential energy function 
(described in Methods) for a coarse grained protein 
representation (Figure 1), was evaluated on structure fit 
to sequence (the fold recognition problem) and sequence 
fit to structure (the inverse folding problem). A pre- 
viously developed informational scoring function [9], 
which is widely used for evolutionary studies was simi- 
larly tested. Both models produce an energy gap Zf oid (eq. 
14) between native and random conformations for the 
native sequence for 100 proteins in a structurally diverse 
test set. Also, both models produce larger energy gaps for 
the native sequence than random sequences in the native 
conformation from the same set of proteins: an average 
native Zf otd of 1.7 for the physics-based model and 4.4 for 
the informational model. Hence, both model types solve 
the fold recognition and inverse folding problems to 
some degree. Binding function can be scored similarly 
(but on a separate test set of protein complexes, see 
Methods), and a gap Z bind (eq. 18) is produced for both 
models with an average of 0.78 for the informational 
model and 0.90 for the physics-based model. Not surpris- 
ingly, both models score folding (for which they were 
designed) more specifically than binding. 




Figure 1 The two-bead model of protein structure is 

presented. Each residue consists of a Ca bead (grey) and a C(3 

bead (white). Ca beads are placed at the position of the Ca atoms 

and form the backbone through being connected by virtual 

peptide bonds with a torsion angle 4>. Each Ca bead (except Gly) 

binds a Cf3 bead with a bond length b and bond angle 0. Cp 

beads, whose centers reside at the geometric centroid of the 

residue atoms, are separated by a distance ry and have a radius r, 

(proportional to the radius of gyration of the side chain atoms). 
^ J 

Natural protein sequences are not near the modeled 
thermodynamic minimum 

However, evaluation of folding stability (in the sense of 
Sgapi ec l- 15) of near-native sequences for a representa- 
tive protein SAP [25] shows that neither model is highly 
specific for the native sequence (Figure 2). Within 15% 
divergence from the native sequence, 42% of sequences 
are more stable than the native under the physics-based 
model, and for the informational model this rises to 
73%. This is not consistent with biological protein 
sequences, for which the vast majority of mutations are 
destabilizing [26]. The sampling procedure also uncovers 
sequences up to 2.8 x more stable than the native within 
30% divergence (4.8 x for the informational model). The 
landscapes for both models are qualitatively similar: the 
native sequence sits on a steep slope and there are local 
minima elsewhere (Figure 2). The physics-based func- 
tion yields a somewhat funnel-like landscape, whereas 
the informational model produces a surface with multi- 
ple similar minima. In both cases the overall slope and 
large fraction of adaptive changes would be expected to 
cause directional selection away from the native state. 
Simulations corroborate this by showing rapid diver- 
gence from the native state for both models (Figure 3). 

Sequences near the modeled thermodynamic minimum 
have protein-like properties 

As the native sequence of SAP is not near a stable equi- 
librium for folding stability on the landscape, stability- 
biased Markov Chain Monte Carlo (MCMC) sampling 
(see Methods) was used to find such minima. As seen in 
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Figure 2 Folding scores for near-native sequences threaded through the conformation of the SAP protein are shown. The score 
distribution is shown as a landscape, with the percentage of the negative native score (i.e -S gop naWe , see eq. 15) on the Z axis and coordinates 
in high-dimensional sequence space projected onto the X and Y axes via Sammon mapping. Smaller scores denote more stable sequences (in 
blue), with larger scores being less stable (becoming progressively more red). The native sequence is marked by a black dot. A) The score 
distribution for the informational scoring function. B) The score distribution for the physics-based scoring function. 

V J 



Figure 4 (left), these highly stable sequences are very 
dissimilar to the native state, with only 5% and 11% 
average identity for the physics-based and informational 
models, respectively. However, both models show high 
conservation at many positions and a marked signal for 
hydrophobicity. The sequences show a rugged energy 
landscape across sequence space and expectedly com- 
plex patterns of change relative to the native sequence 
(Figure 2). 

In the informational model the sequences are domi- 
nated by stretches of hydrophobic residues, interspersed 
with charged residues and a few tyrosines (Figure 4A, 
left). The remaining positions are highly randomized 
and do not maintain a specific biochemical character. 
When mapping the most informative positions onto the 
structure (Figure 4A, right) a fairly simple pattern 
emerges. The P -sheet core is dominated by small hydro- 
phobic residues, surrounded by a shell of larger such 
residues with the occasional tyrosine embedded. Clus- 
ters of charged residues maintain the stability of some 
loops, although most of the highly exposed positions are 
randomized. 

The physics-based model produces qualitatively simi- 
lar results, but differs in some aspects. Core residues are 
prolines rather than leucines, and the larger hydropho- 
bic residues surrounding the core tend to be trypto- 
phans instead of phenylalanines (Figure 4B, left). 
Tyrosines are again favored in semi-exposed positions. 
However, in place of ionic interactions the physics- 
based model prefers glycines. Mapping the informative 
positions onto the structure (Figure 4B, right) reveals 
that these conserved glycines are located in or near 



portions of the backbone that are p -sheet-like and 
highly exposed. Another notable difference is that the 
physics-based model seems more residue-specific as it 
conserves fewer positions overall (36 vs 54), but those 
residues are typically more strongly conserved with lar- 
ger co-evolutionary barriers to change. Both models 
impart roughly the same selective pressure overall. 

Evolutionary simulations near minima reproduce 
biological sequence properties 

Using the stable sequences as starting points for popula- 
tion-based simulations allows comparison of the models 
with respect to other properties than folding stability. 
As can be seen in Figure 5, evolutionary rate ratios in 
both cases are comparable to those found in real ortho- 
logous proteins (dN/dS of 0.1-0.5) and indicative of the 
negative selective pressure applied (dN/dS < 1)[16]. The 
rates also vary as expected depending on the position of 
each residue in the 3D structure. Exposed surface posi- 
tions evolve up to three times faster than the buried and 
highly constrained core residues. Residues in the binding 
site, which are exposed but under functional selective 
pressure, show an intermediate rate. Interestingly, under 
both models the rates in the selectively important 
regions (core and binding site) become very similar over 
time. The physics-based model consistently shows lower 
rates throughout the structure, with the exception of 
initial changes in the binding site. Surface mutations are 
noticeably more selected against compared to the infor- 
mation-based approach, which is consistent with the 
explicit consideration of solvation in this model. The 
physics-based model also restricts movement on the 
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Figure 3 Sequence divergence of the SAP protein over the 
simulation time when starting from the native state for the 
informational (A) and physics-based (B) models. 



sequence-stability landscape to one or a few well-defined 
(but very large in sequence space) basins, whereas the 
informational model appears to diffuse more neutrally 
across a mostly flat surface (Figure 6). 

Real protein sequences also show rate heterogeneity 
across positions (which is modeled by a gamma distribu- 
tion of rates of amino acid substitution per site in statis- 
tical comparison with equal rates across sites), such that 



most positions evolve slowly but some evolve quite 
rapidly [17]. The fit of such a model to the simulated 
sequences was found to be reasonable (Table 1). The 
relative support for including gamma-distributed rates is 
generally high using the informational model, and some- 
what lower in the physics-based model. The latter 
instead shows higher relative support for a model with 
some invariant sites among common parameters in site- 
independent substitution models. The drop in support 
for rate heterogeneity toward the end of the simulations 
may indicate emergence of rate heterotachy [27], caused 
by migration to a non-native stable basin through intra- 
molecular co-evolution. This is consistent with a view of 
proteins where an initial equal rates across sites model 
with a small number of substitutions gives rise to a 
gamma distribution as substiutions accumulate, driven 
by the three dimensional structure. With further substi- 
tution, intramolecular co-evolution gives rise to support 
for a covarion process rather than for a gamma distribu- 
tion [28]. When gamma is supported, the shape para- 
meter of the gamma distribution (0.1-3.0) is consistent 
with those found in biological sequences in general [29] 
and with the shape parameter 1.83 obtained from the 
Pfam SH2 domain seed alignment in particular. 

The sequences and structures produced by each simu- 
lation (Figure 7) further highlight the differences 
between the models. The sequence collection from the 
informational model with truncation selection on fold- 
ing and binding (Figure 7A) shows the same basic fea- 
tures as those found by sampling from stable structures 
(Figure 4A, left) but with added emphasis on tyrosines 
rather than ionic interactions. There is some indication 
of increased conservation of the binding site (around 
positions 52, 68, and 92). During the simulations core 
hydrophobicity increased by -25%, and surface hydro- 
phobicity almost tripled (Table 1). This resulted in a 
similar ratio of core and surface hydrophobicities as in 
the native SAP sequence (1.63 and 1.80, respectively). 
By mapping conserved positions onto the structure (Fig- 
ure 7A, right) we recover the starting pattern (Figure 
4A, right) of conserved P -sheet core, a shell of larger 
hydrophobic residues around this, and tyrosines at med- 
ium exposure. 

As noted above, the physics-based model sampled 
with truncation selection for folding and binding is sub- 
stantially more restrictive in how many non-synon- 
ymous substitutions it allows, and the simulated 
sequences reflect this (Figure 7C, left). Nearly 1/3 of the 
positions are effectively invariant over shorter evolution- 
ary timeframes, compared to only a few residues in the 
informational model. Structural mapping of highly con- 
served positions (Figure 7C, right) again recovers the 
starting pattern (Figure 4B, right), although a few P- 
sheet-associated glycines have been replaced by charges 
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Figure 4 Highly stable sequences and structures for the SAP protein conformation for the informational (A) and physics-based (B) 
scoring functions. Residue frequency distribution in such sequences is shown as a sequence logo (left), and positions with more than 2 bits of 
Shannon information are mapped onto the structure of SAP with equivalent colors (right, non-informative positions shown in white). 



on the most exposed loops. The binding site is very 
highly conserved. Only very exposed positions (17-18, 
59-62, etc.) have started to randomize, mostly contain- 
ing polar and charged residues. Core and surface hydro- 
phobicities started close to those of the native sequence 
(83% and 50% vs 72% and 40%), and did not change 
appreciably during the simulation (Table 1). 

Comparing the simulated sequences to the Hidden 
Markov Model of the Pfam [30] SH2 family (Figure 7B), 
some similarities are noticeable. Both models reproduce 
the overall conservation of the core residues, particularly 
the very buried ones. The informational model is more 
accurate in predicting the residue type within the core 
(leucines and phenylalanines), whereas the physics-based 
model correctly conserves some glycines. Neither model 
captures the defining functional feature of the family (a 
conserved arginine in the binding pocket), but this is 
due to the SAP SH2 domain binding a non-phosphory- 
lated ligand. Instead the models more generally preserve 
residues within the binding interface. This type of con- 
servation is absent in the HMM since SH2 domains 
have unique specificity-determining binding site geome- 
tries. Similarly, since the HMM reflects several dozen 
distinct structures and sequences it is less conserved 
overall than simulated sequences derived from a single 




Surface 
Core 

Binding site 




Figure 5 dN/dS vs dS (proportional to simulation time) for 
surface (black), core (red) and binding site (green) residues is 
shown. Values reflect the mean of the most common allele in each 
population, with standard error bars indicating inter-population 
variation. Simulations using the informational scoring function in A), 
simulations using the physics-based scoring function in B). 
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Figure 6 Sequence divergence over the simulation time when 
starting from a stable state the SAP protein structure is shown 
for the A) Informational model and the B) Physics-based 
model. 



structure evolved over relatively short evolutionary 
distances. 



described elsewhere [9]. Here, the nature of the physics- 
based model is examined in more detail. The term 
weights are re-adjusted for each individual protein in 
the test set, and the resulting distributions are shown in 
Figure 8A. Solvation, side chain angles (bend) and pair- 
wise interactions (LJ) are quite important, consistent 
with a well-packed hydrophobic core with biological 
side chain rotamers and no steric clashes. Buried salt 
bridges (ion) and disulfide bonds (S-S) seem less influ- 
ential, but are also comparatively rare in the investigated 
protein structures. Of the secondary structure terms, 
helix propensity is somewhat more specific than p -sheet 
tendency. For binding, shape complementarity (bLJ) is 
by far the most effective term. However, these para- 
meter values are found in the context of the complete 
scoring function and may have strong synergistic effects 
in combination with other terms. An examination of the 
contribution of each term in isolation to the total gap 
between native and random states (Figure 8B) reveals 
some additional trends. The LJ term is nearly uninfor- 
mative on its own, indicating that it needs to be com- 
bined with at least one other term to be specific. The 
difference between representations of the non-native 
state is visible in the helix and beta terms. The helix 
term is more informative when considering the gap to 
decoy structures, whereas the beta term is more effec- 
tive in discriminating against random sequences. 

Repeating the parameterization step many times on 
the SAP protein reveals further differences between the 
terms. Although the algorithm converges on roughly the 
same optimality score (it is numerically stable), there is 
substantial variation in parameter values. The co-depen- 
dence of roughly equally likely solution sets for term 
weights is shown in Figure 9. Some terms (bend and 
solvation) have pronounced optima around which there 
is little variation, but others (ionic, LJ, helix) produce a 
relatively flat optimality landscape. Particularly the less 
important terms have weights that vary over several 
orders of magnitude. This can cause difficulty with find- 
ing realistic stable sequences for a particular fold, as 
small changes in some terms (notably the ionic term) 
can have large effects on the sequence composition. A 
representative sample of parameter sets was used to 
generate collections of stable sequences, and the most 
protein-like set was selected (Figure 9, filled squares). 
The parameter values fall close to the center of each 
distribution, although charges are de-emphasized and 
solvation has increased importance. 



Components of the physics-based model have variable 
importance for scoring 

The parameters of both models would be expected to 
have an influence on the sequences produced. The 
properties of the informational model have been 



Discussion 

In evaluating our ability to model the evolution of 
sequences using structural and functional (binding) con- 
straints, both informational and physical models show 
protein-like properties, but also need further 
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Table 1 The protein-like properties of sequence evolution over simulation time is shown 



Simulation time 


Model 
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Hydrophobicity 
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Hydrophobicity 

(%) 


Gamma Support 
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Informational 
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Informational 
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20.4 


0.69 
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1 00000 


Informational 


47.4 
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2.07 


1 50000 


Informational 
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2.61 
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Physics-based 


83.4 


49.2 


0.20 


3.31 


25000 


Physics-based 


82.8 


49.5 


0.37 


2.58 


50000 


Physics-based 


81.7 


49.1 


0.31 


2.06 


70000 


Physics-based 


81.7 


48.4 


0.61 


1.86 


100000 


Physics-based 


81.1 


48.6 


0.09 


1.87 


150000 


Physics-based 


81.6 


48.2 


0.25 


1.76 


200000 


Physics-based 


81.1 


48.0 


0.10 


1.88 



Simulation time is listed in generations. Core and surface hydrophobicities shown reflect the population average percentage of hydrophobic residues in buried 
and exposed positions. Gamma support indicates the fractional support for a model including a gamma distribution of rates, and the shape parameter shows the 
a value of such a distribution. 



improvement. Despite a stability gap between random 
sequences (and structures) from the native for both 
methods, the native sequence is not near enough to a 
thermodynamic stability optimum to prevent directional 
selection. Sequences near such an optimum are more 
homopolymeric in composition. The evolution of such 
sequences, however, does show many attributes of real 
protein evolution. 

The effect of negative (purifying) selection, a dN/dS 
ratio smaller than 1, is well modeled by both 
approaches, even reproducing the known correlation 
between exposure and evolutionary rate [31]. The phy- 
sics-based model produces the stronger selective pres- 
sure overall, particularly so on the protein surface (due 
to the solvation term). Variation in this rate across posi- 
tions is mimicked by the informational model, which 
more consistently supports a gamma parameter 
throughout the simulation. The physics-based model 
instead conserves some positions completely over 
shorter evolutionary distances, but ultimately supports a 
covarion process rather than simply rate heterogeneity. 
In particular, hydrophobic residue content is much clo- 
ser to that of the native sequence. When it comes to the 
exact identity of such hydrophobic residues (e.g. leucines 
vs prolines) the informational model has higher corre- 
spondence to native residues, presumably due to being 
derived from known sequence/structure combinations. 
Overall the physics-based model appears more specific 
for local but crucial contributions to free energy of fold- 
ing, whereas the informational model produces 



sequences with protein-like properties but without fold 
specificity. 

In the context of fold specificity, a major difference 
between models is consideration of secondary structure. 
The physics-based model is enriched in prolines in the 
(3 -sheet core, and glycines in other sheet-like portions of 
the backbone. In principle any small hydrophobic resi- 
due should suffice in the tightly packed and highly bur- 
ied core positions, but prolines in particular score well 
in the helical term (being helix breakers in the most 
non-helical segment of the protein) [32] and are mostly 
acceptable in the sheet term. The glycines in half- 
exposed positions with sheet-like geometry provide an 
excellent compromise between the helix, sheet and sol- 
vation terms, and have little effect on the remainder of 
the scoring function. Although somewhat unrealistic in 
the context of the native sequence, this does demon- 
strate the strong impact of the particular backbone con- 
formation under selection. 

The physics-based model is also different in that its 
parameters are adjusted for each protein. This is benefi- 
cial in adding specificity to the model [33], but also pre- 
sents issues with parameter selection [34]. The Lennard- 
Jones term in particular is surprisingly variable and 
offers little benefit in isolation. When testing against 
random sequences, this is mostly due to the side chain 
replacement protocol. It minimizes a Lennard-Jones 
function [35], thereby removing much of the signal a 
priori. Side chain optimization with a complete scoring 
function including other terms might remedy this 
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Figure 7 A comparison of simulated sequences and structures with those of known SH2 proteins is shown. A) and C) depict sequences 
simulated under the informational and physics-based models, respectively, while B) depicts the SH2 family from the Pfam database. Residue 
frequency distributions are shown on the left as sequence logos for A) and C), and as HMM emission probabilities for B). Red boxes indicate 
matching positions between the sequences. For A) and C), positions with more than 2 and 3 bits of information, respectively, were mapped 
onto the SH2 structure on the right with equivalent colors (white beads indicate less informative positions). For B), the 25 most probable 
positions were colored black on the backbone (remaining positions again in white). 



problem. The vastly larger utility in a binding context 
(Figure 8A), where no such pre-minimization occurs, 
further illustrates this point. The test against random 
compact structures with constant sequence (Figure 8B) 
is more illuminating. It shows that the L-J term needs 
the restrictions imposed by other factors (for example, 
solvation and secondary structure) to efficiently identify 
a protein-like heteropolymer. In other words, it adds 
information to the total function, but only when the 
solution space is already somewhat restricted. 

Generally, the parameterization of both folding and 
binding scoring functions is fairly difficult. For instance, 
the choice of the reference state (the unfolded or 



unbound molecules) has a substantial impact on the cal- 
culations. The ruggedness of the optimality landscape in 
parameter space makes it quite difficult to find the glo- 
bal optimum, and parameterizing folding and binding 
functions separately may lead to inconsistencies between 
the two. Future work will explore more sophisticated 
approaches to solving these problems. 

The specificity of either model is ultimately reflected 
by the shape of its stability landscape. As seen both near 
the native sequence (Figure 2) and non-native optima 
(Figure 6), the physics-based model produces a land- 
scape that is more rugged and has fewer equivalent 
minima. This is consistent with theoretical expectations 
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Figure 8 The importance of terms in the physics-based scoring 
function is shown. A) The distributions of relative weights for each 
term in the full scoring function on proteins/complexes in the test 
sets (binding terms prefixed with "b") are depicted. Outliers are 
shown as empty circles. B) The gap between the native score and 
the distribution of random samples when using only single terms in 
the scoring function is depicted. Gaps are shown for the set of 
random sequences (dark grey), the set of random geometries (light 
grey) and the combined set (black). Note that only folding score 
terms are shown. 
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Figure 9 Parameter sets of approximately equally likely term 
weights for the SAP protein are shown Each set is represented 
as a colored line, and the best set is shown as filled red squares. 
The Cys-Cys term was not included because of the lack of disulfide 
bridges in the SAP protein, although there is potentially limited 
information in that term in contrasting with alternative 
conformations with disulfide bridges. 



of the real free energy landscape in both conformational 
[3] and sequence [36] space. The result is that the simu- 
lated sequences tend to be more conserved, stay near 
the starting ("native") point during simulation, and show 
mostly deleterious or neutral mutations. In contrast, 
under the informational model populations simply dif- 
fuse across a relatively flat sequence landscape with less 
co-evolutionary pressure between interacting sites, 
rapidly moving away from the starting point. In this 
aspect the physics-based model shows more protein-like 
and fold-specific behavior overall, despite selecting 
sequences with some compositional "quirks", as dis- 
cussed above. Informational methods, especially those 
built only on pairwise contacts, necessarily lack specifi- 
city. For example, all Lys-Phe interactions are not 
equivalent. Those oriented in a plane will be repulsive, 
while those that are orthogonal will generate a cation-pi 
effect and be attractive. This is but one example. Physi- 
cal models, at the right level of granularity, will not suf- 
fer from this problem and are thereby inherently 
attractive for this interaction specificity. Further, they 
enable study of the forces driving evolutionary processes 
at a physical level. 

Why then are the native states so far from sequence 
optima? The answer is found in the approximations 
made by the models. Packing is perhaps the most 
severely affected, as can be seen from the accumulation 
of large hydrophobic residues in both models. These 
substitutions fill up the empty space that results from 
the coarse-graining procedure. Spherical CP beads, for 
instance, are a particularly poor approximation for elon- 
gated (Lys, Arg, etc.) or flattened (Trp, Phe, etc.) resi- 
dues. A structural model with more than one bead per 
side chain, such as that of Hills and co-workers [37], 
could remedy this issue. The representation of ions as 
spherical point charges is also problematic, and salt 
bridge formation could be better addressed by using a 
model of induced dipoles [38]. Many entropic contribu- 
tions are not considered, although the physics-based 
model includes some of them implicitly in the solvation 
term. In addition to entropic considerations, explicit 
hydrogen bonding [39], cation-pi interactions [40], and 
other interactions are not considered, but would 
increase model complexity. Finally, an overarching pro- 
blem may be found in the effect of individual mutations. 
A typical disadvantageous mutation under the informa- 
tional model has a very small effect on the stability 
score E(s, c) (eq. 12) (< 0.5 kT units out of tens of kT 
units, where k is Boltzmann's constant and T is the 
absolute temperature), and the physics-based model 
generally predicts even smaller stability changes. In real 
proteins the average mutation is strongly deleterious (3 
kcal/mol out of a total AG of -10 kcal/mol, where AG 
is the free energy difference between the unfolded and 
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folded protein structure) [41], indicating that the models 
are much too lenient when it comes to the cost of 
mutation. Appropriate weighting of contributing terms 
or more precise modeling of packing would improve 
this aspect, and make the energy landscape more realis- 
tically rugged. 

Early attempts at considering alternative states in the 
above models of folding used explicit decoys. Because of 
the large number of potential alternative states for a 
sequence, both kinetic traps and thermodynamic optima, 
evaluating even a large finite number of such states 
proved inadequate. Further, identifying the states that 
were most informative near the native confirmation was 
not clearly driven by fold sequence or geometry and it 
may be that the relevant structures originate from smal- 
ler regions of diverse structures. The random contacts 
model [6] proved powerful and efficient in sampling 
many contacts from diverse structures as an implicit 
solution. 

For binding, there is a greater conundrum in that 
while the real biological decoys are not known, a ran- 
dom contacts model has the potential to be too restric- 
tive in selecting against binding interactions that are not 
biologically deleterious. For an alternative binding inter- 
action to be deleterious, the proteins need to have a 
deleterious interaction (as defined by biological fitness) 
and to have the potential to interact at the same time, 
in the same cellular location, at the right concentration 
[42]. However, the proper decoys are frequently not 
known and this restricts the use of the alternative 
approach of explicit binding decoys. 

Further, when evaluating binding specificity as a func- 
tion of affinity it is assumed that complex thermody- 
namic stability (and relatedly, life time) must be 
maximized. While this is a relevant constraint for cellu- 
lar structures such as the nuclear pore complex [43], 
protein complexes involved in signaling or even in form- 
ing the cytoskeleton have transient associations that are 
necessary for functional signaling [44-46]. A require- 
ment of transient binding means that dissociation 
kinetics are selected to be just fast enough, not as slow 
(corresponding to very stable complexes) as possible. 
Therefore, even in the limit of perfect predictions one 
should not expect to recover the native binding site 
sequence based on thermodynamics alone. 

From the perspective of protein structural space [47], 
the folding decoys all lie near the native conformation. 
The total amount of secondary structure is conserved, 
as are the relative percentages of helices and sheets. 
This means that the decoys can be thought of as being 
sampled from the same Superfamily, or at least Fold, in 
the SCOP hierarchy [48]. Conservation of residue expo- 
sures and sequence length also guarantees that the con- 
formations have roughly the same radius of gyration as 



the native state. This should cause selection against 
subtle misfolds and accumulating pathway intermedi- 
ates, but may be a poor representation of the diversity 
of conformations a sequence may fold into and thus 
may miss important local alternative conformations that 
are close in energy. For example, an increase in residues 
with high helical propensity could shift the lowest- 
energy conformation toward an all-a topology instead of 
a+(3. The current approach does not allow modeling 
this type of event, but evaluation of the partition func- 
tion of the Boltzmann probability of folding is a notor- 
iously difficult (and currently unsolved) problem [49]. A 
critical aspect of characterizing the energy landscape is 
the role of decoy structures in restricting the funnel 
shape as well as providing fold specificity, and this is a 
challenging task. 

Recently, the utility of site-interdependent structural 
models of sequence evolution was evaluated in the con- 
text of phylogenetics [50-52]. Despite use of a variety of 
scoring functions and probability measures, the utility of 
such models was found to be generally low. Minimizing 
a coarse-grained potential failed to recover much of the 
native sequence and did not outperform site-indepen- 
dent models [51]. Simulating the effects of substitutions 
on folding energy revealed a rapid divergence from the 
native state even when modeling solvation in addition to 
residue-residue contacts [50], and including the contri- 
bution of folding energy to transition path probability 
can even decrease tree reconstruction accuracy [52]. 
These results are all consistent with the nature of the 
coarse-grained energy landscape seen above. There is no 
global minimum near the native sequence, and its posi- 
tion on a steep slope results in rapid divergence in 
terms of sequence, energy or both. Models that accu- 
rately characterize the structural effects of substitutions 
are urgently needed to make progress in this direction. 
It is also worth noting that these approaches all relied 
on informational potentials, which we have shown above 
to be less structurally specific than physics-based 
scoring. 

With models that properly characterize the evolution 
of proteins, many important evolutionary biological 
questions can be addressed. What are the patterns of 
sequence evolution associated with different mechan- 
isms of duplicate gene retention? Relatedly, how easily 
does orthologous neofunctionalization occur and how 
dependent upon protein fold and binding interface size 
is it? What are the roles of positive and negative pleio- 
tropy in restricting neofunctionalization? How do struc- 
tural transitions occur between neighboring folds? At 
the contact level, how are different stabilizing interac- 
tions interconverted (for example, the transition 
between Cys-Cys, cation-pi, and Coulombic interac- 
tions)? What is the interplay between population size, 
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fold distribution, and functional evolution? To address 
these important evolutionary biology questions, contin- 
ued progress on models such as described here is 
needed. 

Conclusions 

Proteins evolve according to the laws of physical chem- 
istry, biochemistry, and population genetics. Classic phe- 
nomenological site independent models, while 
computationally simple to use, do not adequately 
describe evolutionary processes for either phylogenetic 
or functional evolutionary analysis (comparative geno- 
mic) purposes. Mechanistic models built upon the type 
described here will be necessary. Biological (for example, 
regulatory or metabolic) pathways dictate the level of 
constraint on physical constants and such considerations 
will also need to be taken. While the currently available 
models remain inadequate, an understanding of the pro- 
blems in physical chemistry associated with evolutionary 
models will lead to future improvements, with many 
downstream applications. 

Methods 

The two-bead model 

The protein structure is represented by a reduced two- 
bead model originally described by Levitt [53] and more 
recently used by Mukherjee and Bagchi [54] as well as 
Grahnen et al. [35]. Each residue is represented by one 
backbone bead (CaO and one side chain bead (Cpj (Fig- 
ure 1). The Ca bead is centered on the Ca atom in an 
all-atom representation and has a radius of 1.8 A. For a 
residue i (except glycine), the CPi bead center is placed 
at a distance b t from Coii, which is determined as the 
centroid of all the side chain atoms (including the Co^ 
atom), and assigning a residue-dependent radius r t . Gly- 
cine is simply represented with a Ca bead with the 
appropriate properties (hydrophobicity etc.). For residue 
i, where 2 < = i < - N-2 in a protein with N residues, 
we define 0 t as the angle between Ca^, Coii and C(3i, 
and fa as the torsion angle between Ca^, Ca^, Ca i+1 
and Ca i+2 . Residues within two bonds of the termini do 
not have a defined fa, and the residues at the N-termi- 
nus do not have a defined Q t , Finally, r t j describes the 
distance between the center of any two beads i and ; in 
the model. In the informational representation of pro- 
tein structure r t j was measured by taking the minimum 
distance between the CP beads in residues i and j (Ca is 
substituted for CP for glycines). 

Threading sequences into conformations 

To thread a sequence s into a protein backbone confor- 
mation c, the side chain replacement algorithm SARA 
[35] was used. SARA is a very fast approach based on 



iterative Markov Chain Monte Carlo refinement of the 
replaced side chain geometry. Backbone coordinates 
were not adjusted during the replacement procedure, 
and c was assumed to be constant throughout. 

Scoring protein folding with the physics-based model 

To evaluate the fit of a sequence s to a backbone con- 
formation c, where the number of residues N is the 
same for both s and c, s is threaded into c (see above) 
and then a weighted scoring function V(s, c) is com- 
puted: 

V(s, c) = w bend V bend + w LJ V LJ + w heUx V hd i X + w beta V beta + w ion V ion + w so i v V solv + w S -sV S -s (l) 

where: 

Vbend 1S a harmonic bending potential around the 
equilibrium bond angles for Ca-CP bonds: 

N N-1 

V bend = (l/2)JC e J2 (©< - ©o (0) 2 + (l/2)*e E (®w - 0 o W) 2 (2) 

i=2 i=l 

K@ is the force constant for the bending potential 
(10.0 kj mol" 1 rad" 2 ), 0 t and are defined as 

described above, and 0q is the equilibrium bond angle 
for a CP bead of type t. 

V LJ is the pair-wise vdW interaction potential, 
approximated as a sum of Lennard- Jones potentials: 

*-«i>i(?r-(?)'i 

Sij is the interaction parameter between beads i and j 
(a pair of Ca beads, a pair of one Ca bead and one Cp 
bead or a pair of two Cp beads), based on their respec- 
tive hydropathy indexes, o^- is the collision diameter of 
beads i and j (i.e. the sum of their radii), and r t j is the 
defined in Figure 1. The sum /; runs over all pairs of 
beads noted above. 

y helix 1S the helical potential, an approximation of the 
entropic and steric effects of the backbone forming an 
a -helix: 

Vm„ = [iK/" 3 (r iM2 -r h f + iK/" 4 (r I>3 - r h f] (4) 

Ki 1 ' 3 is average helical propensity of residues i, i+1 
and i+2. r i} i+2 is the distance between Ca beads i and i 
+2. r i} i+3 is the distance between Ca beads i and i+3. 

4 is the average helical propensity of residues i, i+1, 
i+2, and i+3. Finally, r h is the equilibrium helix inter- 
bead distance (5.5 A). 

Vbeta 1S the beta-sheet potential, constructed analo- 
gously to V helix Du t using an equilibrium beta torsion 
angle instead of a bead-bead distance: 
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N-2 

Vbeta = Kl~ 4 J2 ( C b(<Pi ~ n)f (5) 

i=2 

4 is the average beta propensity of residues z-1, i, i 
+i, and i+2. The beta propensity for a given residue 
type is constructed using the same linear scaling of heli- 
cal propensities used by Mukherjee, but is based on Kim 
and Berg's beta propensity scale [55] rather than Pace 
and Scholtz's helix propensity scale [32]. Q, is a scaling 
constant (0.01) selected to scale the range of torsion 
angles to a similar magnitude as the range of bead-bead 
distances in Vheiix- Torsional angle fa is defined above 
(Figure 1), and fa is the equilibrium beta sheet value 
(210°) for a two-bead representation as described by 
Bahar and coworkers [56]. 

V ion is an electrostatic potential based on Coulomb's 
Law: 

V - = C c2.— (6) 

C c is a scaling constant (1,000) selected to scale the 
term to similar magnitude as other terms, q t and qj are 
the charges of residues i and / (+1 or -1 depending on 
residue type), r t j is the distance between beads i and y, 
and s is the dielectric constant of the protein interior 
(3.0) [57], with the sum running over all pairs of charged 
residues with a SASA (see below) of less than 0.25. This 
roughly approximates the strong screening of charged 
interactions due to the highly polarizable surrounding 
water, and the nearly negligible effects of the largely 
non-polarizable interior of the protein [58]. The whole 
term is then scaled by a weighting term w ion when used 
in V(s, c) (eq. 1), so C c has no effect on the parameteri- 
zation, but is necessary computationally so as not to 
introduce errors due to lack of floating point precision. 

V so i v is the potential due to solvation of the protein, 
approximated by a novel implicit solvent model based 
on solvent-accessible surface area (SASA): 

N 

Vsoiv = ^ SASA (0 + Pi (1 - SASA(i)) (7) 

i=i 

hi is the hydrophobicity-based interaction parameter 
of residue i (equivalent to e u above) and p t is the "polar- 
ity index", constructed as p t = (h max -h min ) - h t . SASA(i) 
is the fraction of solvent-accessible surface area of resi- 
due i calculated by the NeighborVector method of Dur- 
ham [59] as adapted for the two-bead representation. 

V S -s represents the stability contribution due to forma- 
tion of predicted disulfide bonds: 



y ss = J2f( ri 'j) (8) 
where 

/ ^ = { 0,' otherwise (9) 

and Tij is the distance between two cysteine CP beads 
and r S s is the maximal Cp-Cp distance in a typical disul- 
fide-bonded cysteine pair (4.5 A[60]). The sum /; runs 
over all pairs of cysteine residues. This form of V cys pre- 
dicts disulfide bonds with a specificity of 0.98, sensitivity 
of 0.91 and Matthews Correlation Coefficient of 0.79 as 
measured on the structural data set used to construct 
SCWRL 3.0 [61]. 

Finally, w x are weights for each individual term in eq. 
1, determined in a procedure described below. 

Scoring protein-protein interactions with the physics- 
based model 

When using protein-protein interaction as a measure of 
protein function, it becomes necessary to evaluate the 
strength of the interaction, a proxy for the free energy 
change of binding or the dissociation constant. The 
non-covalent terms from the folding scoring function 
were adapted to compose an interaction score V(s h c h 
s 2 , c 2 ) between a sequence s 2 in conformation c 2 and a 
sequence s 2 in conformation c 2 : 

V(SiCiS 2 C 2 ) = U>LJ>Vlj + Wion, V'ion ~ ™AsoIvVasoIv (10) 

y'u and V- on are the same as above but evaluated for 
inter- rather than intra-molecular bead pairs. 

VasoIv is the change in solvation potential upon bind- 
ing: 

VasoIv = V so iv{siCi) + V S0 i v (s 2 c 2 ) - Vsoivicomplex) (11) 

where complex is the bound state. The weights w x are 
determined as described below. 

Scoring protein folding and interactions with the 
knowledge-based model 

The approach of Bastolla and co-workers [9] to was 
used to score folding under a knowledge-based model, 
and adapted to scoring protein-protein interactions. 
Briefly, a sequence s in conformation c has a folding 
score of 

E ( s > c) = Y Jli Uab(i,))C(i,j) (12) 

where U a b is a matrix describing the free energy gain 
when amino acids of type a and b are in contact, C is 
the contact map of c (1 when the CP beads of i and j 
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are closer than 4.5 A, 0 otherwise) and the sum runs 
over all residue pairs (i, j) that are separated by more 
than four residues in primary sequence. 

To evaluate a protein-protein interaction, a protein 
complex score is calculated as 



E(si, a, s 2 , c 2 ) = ^2 Uab(i,j)C(i,j) 



(13) 



where C is evaluated between c x and c 2 (inter-molecu- 
lar ly), and the sum runs over all residue pairs (i, j) such 
that i comes from c 1 and j comes from c 2 . 

Calculating folding and binding specificity 

Folding and binding specificity were evaluated with sev- 
eral considerations. For example, with folding, a given 
sequence should have the specified backbone rather 
than an alternative as its most stable state and also pre- 
sent a gap between this state and unfolded or misfolded 
states. The size of this gap can be measured as a Z- 
score [62-64]: 



Zfoid 



(g) - G n 
cr(G) 



(14) 



where G is the distribution of free energies (AG) in 
the alternative conformations, G nat is the free energy of 
folding into the native conformation, and <G> and o(G) 
describe the location and dispersal of that distribution. 
In other words, Zf oid can be interpreted as measuring 
the specificity of a protein sequence for its native struc- 
ture, or equivalently a combination of unfolding and 
misfolding stability as well as the preference for the spe- 
cified fold over alternative folds. 

During simulation and sequence sampling, G nat is 
approximated by V(s, c) (eq. 1) or E(s, c) (eq. 12) for the 
current sequence and a native conformation, and G by 
calculation of the same measure for 100 random decoy 
conformations. The decoy conformations were gener- 
ated by independently randomizing each type of geome- 
trical measurement in the native structure, in 
accordance with the Random Energy Model (REM) of 
Bryngelson and Wolynes [6]. For the informational 
model this means randomizing the residue-residue con- 
tacts, whereas in the physics-based model the angles, 
distances, etc. involved in each term are shuffled. The 
fold score gap is then calculated as 



Jgap 



(G) - G nai = v{G).Zf 0 id 



(15) 



where <G> is estimated by the median of G. The dis- 
persal o(G) is assumed to be constant. This assumption 
is computationally efficient, which is a major considera- 
tion when evaluating tens of millions of sequences. Also, 
since the dispersal is purely dependent on the amino 
acid composition under the REM [65], and the gross 



features of the composition (the proportion of hydro- 
phobic residues for example) should not change when 
simulating biologically realistic protein sequences, this 
enables the approximation Zj 



'fold 



The case of specific protein-protein interaction is sim- 
pler due to the lower number of states available. It was 
previously shown that specific binding can be adequately 
modeled as a combination of selection for the native 
ligand and against a single non-specific decoy ligand 
[14]. Here this approach is adopted during simulations, 
which are described in more detail below. 

Parameterizing the physics-based scoring functions 

It has previously been shown that no single parameteri- 
zation of an energy function fits all proteins optimally 
[34,33]. Therefore the individual weights in the scoring 
function were changed for each protein under investiga- 
tion to maximize Zf oid (maximize the specificity of 
sequence-structure fit in the context of alternative 
states). The parameter space has many dimensions, is 
expected to be rugged, and likely contains many local 
maxima. In order to sample possible parameter values 
efficiently, and avoid becoming trapped in the course of 
the search procedure, the Metropolis-Hastings Markov 
Chain Monte Carlo (MCMC) algorithm [66,67] was 
used. Acceptance probability for a move in parameter 
space was based on Zf 0 i d : 

II if Zf 0 id(0k) > Zf 0 i d (O k -i) 

ZjbidiOk) - Zjbid{0k-i) +l . (16) 

e- otherwise 
T 

where 6 k is the set of weights in V(s, c) (eq. 1) for step 
k of the Markov chain, and T is the temperature of the 
chain. 6 k is proposed by perturbing q k _ 1 in each dimen- 
sion with values drawn from a Gaussian distribution 
with mean 0.1 and variance 0.1. The search space was 
restricted to (0,1) for all weights, and 100,000 moves 
were attempted at T = 0.1. The parameter set with the 
largest Zf oid found during sampling was retained as the 
best set of weights. 

To obtain the distribution of non-native scores G 
necessary to calculate Zf oid in this procedure, a two- 
pronged approach was taken. The goal was to find a set 
of weights that makes the native sequence-structure 
combination as specific as possible, both with respect to 
the fold recognition problem (which conformation does 
a fixed sequence fold into?) and the inverse folding pro- 
blem (which sequence best fits a fixed conformation?). 
Because random sequences may fold into alternative 
structures, the random sequences also play a role in 
parameterizing the fold recognition problem, where hav- 
ing a hydrophobic core and preferring a given backbone 
to an alternative state is not enough. To address 



Grahnen et al. BMC Evolutionary Biology 201 1, 1 1:361 
http://www.biomedcentral.eom/1471-2148/1 1/361 



Page 15 of 18 



specificity of conformation, 1,000 decoy conformations 
were generated as described above to obtain the distri- 
bution of scores G struct . Specificity of sequence was 
addressed by sampling 1,000 random sequences of the 
same length as the native sequence (with equal probabil- 
ity of drawing any amino acid), threading each one 
through the native conformation, and obtaining from 
these the distribution of scores G seq . For each score G, 
composed of individual term scores from V(s, c) (eq. 1), 
the value of each term score was re-scaled such that the 
overall term-specific distributions of values were of 
equal range. The full distribution G was formed by the 
union of G struct and G seq for each % 

When threading random sequences, or randomly per- 
turbing the native conformation, the resulting score dis- 
tribution is noticeably skewed (data not shown). In 
particular, the Lennard-Jones potential produces a long 
tail of very large scores due to steric clashes. This is not 
unexpected since not all structures can accommodate all 
sequences [68], but it does make mean and variance 
poor estimators of location and dispersal. Fortunately, 
the probability of observing a sequence s in a particular 
conformation c follows a Boltzmann distribution: 



p(AG(s, c), T) = 



e -AG{s,c)/kT 
S .0-AG(5 /Ci )/feT 



(17) 



where i runs over all possible other conformations, 
AG is the free energy of folding, h is Boltzmann's con- 
stant and T is the absolute temperature. It follows that 
only those alternative conformations with large negative 
AG contribute greatly to this probability, or equivalently 
that only those decoys with very low scores V(s, c) (eq. 
1) are important for folding specificity. Therefore the 
median was used as a location estimator, and the differ- 
ence between the first quartile and the median as the 
estimator of dispersal. 

To choose the term weights for a particular protein- 
protein interaction, the same procedure was carried out, 
but only with respect to sequence specificity in one 
binding partner. The score distribution G seq was gener- 
ated by threading 1,000 random sequences through the 
constant conformation of one binding partner (while the 
other partner is kept constant in both sequence and 
structure), and scoring the resulting complex using V(s lf 
c h c 2 ) (eq. 10) described above. In other words, s 2 is 
randomly sampled while s lf c 2 and c 2 remain constant. 
It is then possible to calculate a gap score Z bind in the 
same manner as Zf 0 i d : 



Zbind 



{Gbind) — Gngf 
& {Gbind) 



(18) 



where G hind = G seq . Weight parameters w x can then 
be chosen using the MCMC algorithm described above 
(eq. 16). 

Sampling sequences 

Once a set of weights for V(s, c) (eq. 1) are obtained 
(the parameters U ah of E(s, c) in eq. 12 are fixed), the 
folding score gap landscape and its minima in sequence 
space were characterized for both scoring functions by 
varying 5. Sequences were sampled based on S gap (eq. 
15) both near to and far from the native state using a 
typical [69] MCMC approach, with acceptance probabil- 
ity: 



Pith) = 



1 if S gap (Sfe) > S g ap (Sfc-i) 

p{s k )-S ga p{Sk-l) _.7__ 



otherwise 



where each step k was a single substitution in the pro- 
tein sequence. 

For near-native sampling two temperatures were 
employed: a very high temperature for nearly unbiased 
sampling (T = 10 for the informational function E(s, c) 
of eq. 12, T = 50 for the physics-based function V(s, c) 
of eq. 1) and a lower temperature to find local minima 
(T = 0.1 for E(s, c), T = 1.5 for V(s, c)). The low tem- 
perature was chosen to make the chain capable of pas- 
sing the barrier in the energy landscape caused by an 
average deleterious substitution under the informational 
model (-0.1 units of E(s, c)) with a reasonably high 
probability (-0.1). This was also calibrated with respect 
to the expected fraction of sequences with better folding 
stability for real proteins and indirectly on dN/dS. The 
high temperature was chosen to make the sampling 
nearly independent of S gap (> 0.95 expected acceptance 
probability). Temperatures for the physics-based func- 
tion were then chosen such that the actual acceptance 
frequencies obtained from sampling under the informa- 
tional model were reproduced. Differences in the 
observed dN/dS ratios between the physics-based and 
informational models were due to differences in the rug- 
gedness of the landscape and its sequence context 
dependence (local ruggedness). 

To increase sample density, divergence of the chains 
from the starting state was restricted to 5%-30%, in 
increments of 5%, and two chains were run at each tem- 
perature and divergence limit. The number of attempted 
steps was adjusted to obtain -8,500 unique samples for 
each function. Samples were projected onto a two- 
dimensional representation of sequence space using 
Sammon mapping [70] and score surfaces were calcu- 
lated using Gnuplot v 4.2. 

To discover regions of the landscape where most 
mutations are destabilizing (i.e S gap is near a maximum), 
an important feature of biological protein sequences 
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leading to observed dN/dS ratios, ten replicate chains 
attempting 100,000 steps were run for each function at 
the lower sampling temperature without any divergence 
restrictions. The chains were thinned by taking every 
fourth unique sample and the final 100 such samples 
were retained, resulting in 1,000 unique samples for 
each scoring function. Samples were Sammon-mapped 
and sequence logos [71] were calculated to assess con- 
vergence. Structural properties of the sampled equilibria 
were further characterized by projecting the consensus 
sequence at each position with > 2 bits of information 
onto the protein conformation. 

Measuring scoring performance 

To test the accuracy of the scoring functions, Zf oid was 
calculated for the native sequence and 1,000 random 
sequences. The test was performed on the same diverse 
set of structures used in the development of the side 
chain replacement method employed [35]. 

The interaction score was tested by a similar scoring 
procedure. In this case a subset of the structures from 
the PepX database [72], which describes protein-peptide 
interactions, was constructed. As for the folding score 
test set, 100 structures (25 from each major SCOP 
Class) were selected, each a centroid of a cluster of 
binding interfaces at the 3 A-75% level of PepX. For 
each Class with more than 25 available centroids, the 
structures were sorted by average B-factor across the 
binding interface, and the 25 structures with the lowest 
B-factor were selected. A list of these structures can be 
found in Additional File 1. Z hind was calculated as 
described above (eq 18) for the native peptide sequence 
and a set of 1,000 random peptide sequences for each 
complex. 

In addition, the distribution of weights for each term 
in V(s, c) (eq. 1) and V(s lt c h s 2 , c 2 ) (eq. 10) across the 
structural data sets was generated to compare the rela- 
tive importance of the terms. The effect on the gap 
score of using each term individually was also examined. 
Furthermore, the numerical stability of the MCMC algo- 
rithm for choosing weights and the characteristics of the 
parameter-stability landscape were probed by producing 
eight different sets of non-native scores G for the SAP 
protein, and re-running the parameterization 100 times 
for each set. 

Simulating sequences 

As a test of the utility of the methods in an evolutionary 
context, simulations were performed under negative 
selection for protein folding and binding. In a manner 
similar to that of Rastogi et al. [11], a population of 
1,000 virtual organisms containing a single copy of the 
SAP protein [25] (PDB ID: 1D4T) was evolved for 
200,000 generations at a rate of 10" 5 mutations per bp 



per generation at the DNA level, with transitions being 
twice as probable as tranversions. To obtain a thermo- 
dynamically stable starting point, non-native optima of 
S gap (eq. 15) in protein sequence space were sampled as 
described above and a starting DNA sequence was ran- 
domly selected from the reverse translation of those 
samples. 

In each generation, the mutated DNA sequence in 
each organism was translated to protein, threaded 
through the native SAP conformation c proteim and the 
fitness calculated as 

f 0 ifS g ap{s)<S^orV bind {s)>VS 
w{s) = I 0.9 if S sap {s) > S s ^and V bind {s) < V^and V decoy {s) < Vg$ (20) 
[ 1 otherwise 

where S gap (s) is the fold gap score (eq. 15) of the cur- 
rent sequence s, S gap start is the fold gap score of the 
starting sequence, Vbind( s ) is the binding score V(s, c pro . 

teiw s ligand> c ligand) ( ec l' 10)> s ligand an d Cu gand Correspond 

to the SLAM peptide (the native ligand of SAP), V bind - 
start is the binding score of the starting protein sequence 
to the same ligand, and V dec0 y(s) is the binding score V 
(s, Cproteim s de coy> Cligand) for a decoy ligand. Simulations 
under the informational model used E(s, c) (eq. 12) to 
compute S gap and E(s h c h s 2 , c 2 ) (eq. 13) to compute 
binding scores. In other words, mutations that make the 
protein or the protein-ligand complex less stable than 
the starting point are infinitely deleterious, mutations 
that maintain those stabilities but enable binding of the 
decoy ligand are somewhat deleterious, and all other 
changes are neutral. Organisms were then propagated to 
the next generation by random sampling weighted by 
fitness w (eq. 20), with replacement, while maintaining a 
constant population size. 

The decoy ligand were constructed by threading a 
sequence s decoy = IWMTIYMIIIT through the SLAM 
peptide conformation c Ugand for the informational func- 
tion, and s decoy = RLPTIYICITG for the physics-based 
function. Both sequences are variations on the experi- 
mentally determined XXXTIYXX(VI)XX SAP-binding 
motif [25], and do not bind the protein at the starting 
point of the simulation. 

Finally, each simulation was replicated lOx, and the 
resulting sequences analyzed. Structural patterns of evo- 
lutionary rates (dN/dS) were measured by comparing 
sequences to the original sequence and calculating 
according to the PBL method [73,16], and the distribu- 
tion of position-specific residue preference was com- 
pared with that in the Pfam database [30] for the same 
fold by constructing sequence logos [71] from the 
evolved sequences. The sequences were also tested for 
emergence of rate heterogeneity by assessing the super- 
ior fit of the data to the rates-across-sites model over an 
equal-rates model using ProtTest [74]. To discern if the 
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novel solvation model preserves the general pattern of 
hydrophobic core and hydrophilic surface the propor- 
tion of hydrophobic and hydrophilic residues in parts of 
the protein were measured before and after simulation. 

Availability 

The scoring function and simulation software are avail- 
able as C++ code by request The code will be released 
as part of open source software for sequence simulation 
to be described in a future publication. 

Additional material 



Additional file 1: Table SI. The list of complexes used for testing the 
binding scoring functions is shown. The PDB accession number, the 
SCOP Class the complex belongs to, the corresponding Uniprot 
sequence (if any), and the long protein name from Uniprot (if any) are 
listed. 
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